\pagestyle{myheadings} \setcounter{page}{1} \setcounter{footnote}{0}

\section{~Setting up for distributed machines (MPI)} \label{app:mpi}
\newcounters 
\vssub
\subsection{~Model setup}
\vssub

In order to run \ws\ on a distributed memory machine using \mpi, two
requirements need to be met. First, all executables need to be compiled
properly. This implies that the codes are compiled with the proper \ws\
options (switches), and with the proper compiler options. Second, the parallel
version of the model needs to be run in a proper parallel environment. This
implies that the parallel codes are run on a multi-processor machine, invoking
the proper parallel environment on that machine. These two issues are
discussed in some detail below.

Of all the \ws\ programs described in section~\ref{chapt:run}, only three
benefit from a parallel implementation with \mpi: the actual models {\file
ww3\_shel} and {\file ww3\_multi}, and the initial conditions program {\file
ww3\_strt}. {\file ww3\_strt} is typically not used in operational
environments, and can generally be run in single processor mode. The main
reason for running {\file ww3\_strt} in multi-processor mode is to reduce its
memory requirements. These three codes are the only codes that manipulate all
spectra for all grid points simultaneously, and hence require much more memory
than all other \ws\ programs. An added benefit (other than reduced run times)
of running these programs in parallel is that the parallel versions of these
programs require less memory per processor if the number of processors is
increased.

Considering the above, it is sufficient for most implementations on parallel
machines to compile only the main programs {\file ww3\_shel} and {\file
ww3\_multi} with the \mpi\ options. All other \ws\ programs with the exception
of {\file ww3\_strt} are designed for single-processor use. The latter
programs should not be run in a parallel environment, because this will lead
to I/O errors in output files. Furthermore, there is no possible gain in run
time for these codes in a parallel environment due to their design. Because
all programs share subroutines, it is important to assure that this
compilation is done correctly, that is, that the subroutines and main programs
are compiled with compatible compiler settings. This implies that subroutines
that are shared between parallel and non-parallel programs should be compiled
individually for each application.

The first step for compiling the \mpi\ version of programs is to assure that
the proper compiler and compiler options are used. Examples of this for an IBM
system using the xlf compiler, and a Linux system using the Portland compiler
can be found in the example {\file comp} and {\file link} scripts provided
with the distribution of \ws.

The second step is to invoke the proper compile options (switches) in
compiling all parts of \ws. Most programs will be compiled for
single-processor use. To assure that all subroutines are consistent with the
main programs to which they are linked, the compile procedure should be
divided into two parts.  A simple script that will properly compile all \ws\
programs is given in Fig.~\ref{fig:make_MPI}.  An expanded version of this
example is now available as \command{make\_MPI} or in \command{w3\_automake}. Alternatively, the commands
in the script can be run interactively, while directly editing the {\file
  switch} file when appropriate.

\begin{figure}
\begin{center} \begin{tabular}{|c|} \hline \\
\begin{minipage}{5in} 
\begin{verbatim}
#!/bin/sh

# Generate appropriate switch file for shared and 
# distributed computational environments

  cp switch switch.hold
  sed -e 's/DIST/SHRD/g' \
      -e 's/MPI //g'      switch.hold > switch.shrd
  sed 's/SHRD/DIST MPI/g' switch.hold > switch.MPI

# Make all single processor codes

  cp switch.shrd switch
  w3_make ww3_grid ww3_strt ww3_prep ww3_outf ww3_outp \
          ww3_trck ww3_grib gx_outf gx_outp

# Make all parallel codes

  cp switch.MPI switch
  w3_make ww3_shel ww3_multi

# Go back to a selected switch file

  cp switch.shrd switch
# cp switch.hold switch

# Clean up

  rm -f switch.hold switch.shrd switch.MPI
  w3_clean

# end of script
\end{verbatim}
\end{minipage} \\ \\ \hline
\end{tabular} \end{center}

\caption{Simple script to assure proper compilation of all \ws\ codes in a
   distributed (\mpi) environment. This script assumes that the {\F shrd}
   switch is selected in the {\file switch} file before the script is run.}
   \label{fig:make_MPI}
%\botline
\end{figure}

An alternative way of consistently compiling the code is to first extract all
necessary subroutines per code using {\file w3\_source}, then put the sources
and the makefile in individual directories, and compile using the {\file make}
command. In this case the code for {\file ww3\_shel} and {\file ww3\_multi}
are extracted using the appropriate \mpi\ switches, whereas all other codes
are extracted using the switches for the shared memory architecture.

After all codes have been compiled properly, the actual wave models {\file
ww3\_shell} and {\code ww3\_multi} needs to be run in the proper parallel
environment. The actual parallel environment depends largely on the computer
system used. For instance, on {\ncep}'s IBM systems, the number of processors
and the proper environment is set in `job cards' at the beginning of the
script. The code is then directed to the parallel environment by invoking it
as

\command{poe ww3\_shel}

\noindent
Conversely, on many Linux types systems, the \mpi\ implementation includes the
{\file mpirun} command which is typically used in the form

\command{mpirun -np \$NP  ww3\_shel}

\noindent
where the {\code -np \$NP} option typically requests a number of processes
from a resource file ({\code \$NP} is a shell script variable with a numerical
value). For details of running parallel codes on your system, please refer to
the manual or user support (if available).

Note that the as a part of the parallel model setup, I/O options are available
to select between parallel and non-parallel file systems \citep[see
also][]{tol:MMAB03a}.


% -------------------------------------------------------------
\vssub
\subsection{~Common errors}
\vssub

Some of the most common errors made in attempting to run {\file ww3\_shel} and
{\file ww3\_multi} under \mpi\ are:

\begin{itemize}
\item Running in a parallel environment with a serial code (no
      \mpi\ in compilation). \\ \vspace{-3mm}

   This will result in corrupted data files, because all processes are
   attempting to write to the same file. This can be identified by the
   standard output of {\file ww3\_shel}. The proper parallel version of the
   code will produce each output line only once. The non-parallel version will
   produce one copy of each output line for each individual process started.

\item You are running in a parallel environment with a serial code (programs
   other than intended MPI codes). \\ \vspace{-3mm}

   This will result in corrupted data files, because all processes are
   attempting to write to the same file. This can be identified by the
   standard output of the programs, which will produce multiple copies of each
   output line.

\item {\file ww3\_shel} or {\file ww3\_multi} are compiled properly, but not
   run in a parallel environment. \\ \vspace{-3mm}

   On some systems, this will result in automatic failure of the execution of
   {\code ww3\_shel}. If this does not occur, this can only be traced by
   using system tools for tracking when and where the code is running.

\item During compilation serial and parallel compiled subroutines are mixed.
      \\ \vspace{-3mm}

   This is the most common source of compiling, linking and run time errors
   of the code. Follow the steps outlined in the previous section to avoid
   this. 

\end{itemize}


%%%%%%%%%%%%

\vssub
\subsection{~MPI point-to-point communication errors}
\vssub

Running {\file ww3\_multi} in parallel with several large overlapping
grids involves a large number of concurrently active MPI point-to-point
communications (MPI send/recv pairs).  For correct execution, each active
MPI message must have a unique envelope (send id, recv id, tag, communicator)
with an allowed tag value.  In this context two types of MPI point-to-point
communication errors may occur:
(1) the MPI message tag value exceeds an upper-bound or
(2) two or more MPI messages have the same envelope.
The first error may result in {\file ww3\_multi} crashing with
a MPI ``invalid tag" error or an internal tag upper-bound exceeded error.
The second error may result in spectra sent from one MPI task to another being
delivered to the wrong location.  The second error is more difficult to detect
in that it is not trapped by MPI and may only be manifested as strange results
in model output.

To address these possible errors the allowed ranges of MPI tags for the different
sets of point-to-point communication in {\file ww3\_multi} are controlled by the
$MTAGB$, $MTAG0$, $MTAG1$, $MTAG2$, and $MTAG\_UB$ parameters defined in {\file WMMDATMD}.
These parameters must satisfy
$MTAGB \geq 0$ and $7*NRGRD-1 \leq MTAG0 < MTAG1 < MTAG2 < MTAG\_UB \leq MPI\_TAG\_UB$,
where $MPI\_TAG\_UB$ is the tag upper-bound for the MPI implementation.

The value of $MPI\_TAG\_UB$ for a specific MPI implementation can be obtained at
run-time using the $MPI\_COMM\_GET\_ATTR$ routine.
An MPI implementation is free to set the value of $MPI\_TAG\_UB$ larger than
the minimum set by the MPI standard ($32767 = 2^{15}-1$).
In the current release version of OpenMPI, the value of $MPI\_TAG\_UB$ is
2147483647 ($2^{31}-1$).
On the Cray XC40 with Cray MPICH, the value of $MPI\_TAG\_UB$ is much smaller,
that is, 2097151 ($2^{21}-1$).
As the currently known lowest value of $MPI\_TAG\_UB$ amongst available parallel
platforms, the Cray XC40 value is used to set $MTAG\_UB$ in {\file WMMDATMD}.

If an MPI tag value exceeds the upper-bound ($MPI\_TAG\_UB$) imposed by the MPI
implementation, {\file ww3\_multi} may crash with a MPI ``invalid tag" error.
If an MPI tag value exceeds one of the internal tag upper-bounds, {\file ww3\_multi}
will crash with error code 1001 and a report of which tag upper-bound was exceeded.
What follows is a description of the allowed MPI tag ranges and how they are set.

The $MTAGB$ parameter is only used as the tag lower-bound for blocking communication
that does not overlap with other point-to-point communication.
Hence it is sufficent to set $MTAGB$ to the lowest allowed MPI tag value of 0.

In addition to being the tag lower-bound for communication of internal boundary data in
{\file WMINIOMD}, the $MTAG0$ parameter is used in {\file WMIOPOMD} as the tag upper-bound
for the unified point output communication.  To ensure that the unified point
output communication tag values are $\geq 0$, $MTAG0$ must be at least $7*NRGRD-1$.
A generous setting of $MTAG0 = 1000$ is used in {\file WMMDATMD}.

The allowed tag range for point-to-point communication of internal boundary data
({\file WMINIOMD:WMIOBS}) is $(MTAG0,MTAG1]$.
Given that this communication involves only the boundary points of the model
grids a generous setting of $MTAG1 = 10000$ is used in {\file WMMDATMD}.

The allowed tag range for point-to-point communication from high rank to low rank grids
({\file WMINIOMD:WMIOHS}) is $(MTAG1,MTAG2]$.
The allowed tag range for point-to-point communication between equal rank model grids
({\file WMINIOMD:WMIOES}) is $(MTAG2,MTAG\_UB]$.
The high-rank-to-low-rank and equal-rank communications involve both
boundary and interior points of model grids.
Hence, the allowed tag ranges for these two communication sets should be
larger than the allowed range for the communication of internal boundary data.
In nested grid applications (e.g., a single global grid with several regional nests)
the required tag range for high-rank-to-low-rank communications will be larger
than the required tag range for the equal-rank communications.
The setting of $MTAG2 = 1500000$ is used in {\file WMMDATMD} to give a larger portion
of the total allowed range of tags for the high-rank-to-low-rank communications.
Other multiple grid applications may require adjusting $MTAG2$.


%\bpage \pagestyle{empty}
